library(tidyverse)
library(sn)
library(knitr)
library(altair)
library(reticulate)
alt$data_transformers$disable_max_rows()
# Set basic theme for plots
theme_hcb <- theme_classic() +
theme(panel.grid.major.y=element_line(colour = 'lightgrey')
)
theme_set(theme_hcb)
The best way to calculate the rate of death from an illness is to look at the outcome for each confirmed case but data with this level of detail isn’t always released publicly. NSW Health publishes daily cases and deaths for COVID as well as some summary statistics in a weekly surveillance report. The death rate can be estimated using the daily figures but it is necessary to account for the lag time between testing positive and passing away. This is pretty straightforward, but what should the lag be?
The weekly surveillance reports published by NSW Health3 state a median lag of 11 days from symptom onset to death. This has been consistent throughout the second half of 2021 where Delta was the dominant variant in NSW. It is also possible to calculate the observed lag by comparing case numbers and death numbers. This method shows a lag of 20 days2. This is a big difference! The observed lag is almost double the median.
These numbers point to a flat, heavily skewed distribution1. It makes sense that it could be approximated with a series of Poisson distributions, but without more information it’s just guesswork. It’s easier (and enough to show the effects) to define a distribution that has some of the observed properties and see what happens when it is applied to real world data. This should help explain the difference in the median lag and the observed lag and inform the choice of lag time.
# Skewed distribution to approximate observed data
set.seed(55555)
# Set shape parameters
om=15
al=6
del=1
sn = rsn(n=10000, xi=del, omega=om, alpha=al)
sn = sn[sn>=0]
A skewed normal distribution was used to approximate the data. The median was set to match the NSW Health figure of 11 days, with a shape resembling those published by UK Health1. NB: This distribution is for the original strain.
# Plot approximation distribution
SN=data.frame(sn)
ggplot(data = SN, mapping = aes(x=sn)) +
geom_histogram(
color='white',
fill='skyblue',
bins=50
) +
labs(
x='Days after testing positive',
y='Count',
title='Approximate distribution of days until death'
)

Summary statistics for the model distribution:
# Summary stats for generated data
kable(as.matrix(t(summary(sn))))
| 0.0060008 |
6.236993 |
11.4796 |
13.28834 |
18.57076 |
65.27701 |
This corresponds to the distribution of deaths from one day of infections.
Using the case number data from NSW Health (left graph), it should be possible to re-create the general shape of the death data. For each day a random sample is taken from the model distribution, essentially simulating when the deaths would occur. These are first laid over each other (top right graph) and then stacked on top of each other (bottom right graph).
The purpose here is to show how the difference in lag occurs, so the overall shapes are more important than the number totals. Due to the relatively small number of cases each day, rather than take a percentage of cases to reflect the death rate no scaling has been done.
Selecting a day on the left graph shows where the deaths from that day would likely occur. The long right tail of the model distribution pushes the peak of the stacked histogram to the right.
# Read in case data and select NSW
cases = read.csv('cases_daily_state_NET.csv')
cases <- subset(cases, select=c(ï..Date,NSW))
# Convert date format and select time period of interest
cases <- rename(cases, Date=ï..Date)
cases$Date <- as.Date(cases$Date, '%d/%m/%y')
cases <- cases %>% filter(Date>'2021-06-15' & Date<'2021-10-31')
# Convert count format
cases$NSW <- as.numeric(gsub(",", "", cases$NSW))
# Make data frame with distributions
DS = data.frame(pos=integer(),day=character())
for(i in 1:length(cases$NSW))
{
# generate distribution for that days positive tests + num days since start
pos_tests = cases$NSW[i]
if (pos_tests > 0) {
d_dist <- rsn(n=pos_tests, xi=del, omega=om, alpha=al)
d_dist <- d_dist[d_dist>=0]+i
d_dist <- data.frame(d_dist, i) %>% rename(deaths=d_dist,day=i)
DS <- rbind(DS, d_dist)
}
}
# Convert day of infection and death to date - Altair needs timestamp
# Set start date
start <- as.POSIXct(as.Date('2021-06-15'))
# Define one day
plus1 <- as.POSIXct(as.Date('2021-06-16'))
one_day <- plus1-start
# Add to DS frame
DS$Date <- start + DS$day*one_day
DS$d_date <- start + floor(DS$deaths)*one_day
DS <- subset(DS, select = c(deaths, Date))
# Set number of bins for histograms
num_bins <- ceiling(max(DS$deaths)-min(DS$deaths))
# Extra entry - hack to extend x-axis for case histogram
# Altair doesn't do this nicely with datetimes
# NB there are 187 days from 16/6 - 20/12
#end <- as.POSIXct(as.Date('2021-12-20'))
#row <- data.frame(179,end)
#names(row) <- c("deaths", "Date")
#DS <- rbind(DS, row)
### Build interactive charts
# Define interactive plot selector
selector <- alt$selection_single(empty = "all", fields = list("Date"))
# Histogram of cases with line and date of max cases
h_case <-
alt$Chart(DS)$
mark_bar(size=2)$
encode(
x = alt$X(
"Date:T",
axis=alt$Axis(
title='Date',
grid=F
),
timeUnit="monthdate"
),
y = "count(Date):Q",
color = alt$condition(
selector,
"Date:T",
legend=NULL,
scale=alt$Scale(scheme=alt$SchemeParams(name='viridis', extent=c(-0.5,1.5))),
alt$value("lightgray")
)
)$
properties(
selection = selector,
title='Daily cases - NSW'
)
# Create dataframe with mode value for line
m_case <- as.POSIXct(as.Date('2021-09-11'))
# Put vertical line showing mode on graph
l_case <-
alt$Chart(data.frame(m_case))$
mark_rule(
color='darkgrey',
strokeDash=array(1,1)
)$
encode(x='monthdate(m_case):T')
# Create text dataframe
td_case <- '11-Sep'
t_case <-
alt$Chart(data.frame(td_case))$
mark_text(
dx=52,
dy=-110
)$
encode(text='td_case:N')$
properties(width = 350, height = 200)
# Combine layers
d_case <- h_case + l_case + t_case
### Distribution plots
# Overlaid histograms
h_over <-
alt$Chart(DS)$
mark_area(interpolate="step")$
encode(
x = alt$X(
"deaths",
bin=alt$Bin(maxbins=num_bins),
scale = alt$Scale(domain = c(0, 187)),
axis=alt$Axis(title='Days since 16th June 2021')
),
y = alt$Y(
"count():Q",
stack=F,
scale = alt$Scale(domain = c(0, 100))
),
color = alt$Color(
"Date:T",
legend=NULL
),
opacity=alt$condition(
selector,
alt$value(1),
alt$value(0.05)
)
)$
add_selection(selector)$
properties(
selection = selector,
title='Date of death histograms - overlaid'
)
# Create dataframe with mode value for line
m_over <- 95
l_over <-
alt$Chart(data.frame(m_over))$
mark_rule(
color='darkgrey',
strokeDash=array(1,1)
)$
encode(x='m_over:Q')
# Create text dataframe
td_over <- '19-Sep'
t_over <-
alt$Chart(data.frame(td_over))$
mark_text(
dx=-11,
dy=-110
)$
encode(text='td_over:N')$
properties(width = 350, height = 200)
# Combine layers
d_over <- h_over + l_over + t_over
# Stacked histograms
h_stak <-
alt$Chart(DS)$
mark_area(interpolate="step")$
encode(
x = alt$X(
"deaths",
bin=alt$Bin(maxbins=num_bins),
scale = alt$Scale(domain = c(0, 187)),
axis=alt$Axis(title='Days since 16th June 2021')
),
y = alt$Y(
"count()",
stack=T,
scale = alt$Scale(domain = c(0, 1250))
),
color = alt$Color(
"Date:T",
legend=NULL
),
opacity=alt$condition(
selector,
alt$value(1),
alt$value(0.3)
)
)$
add_selection(selector)$
properties(
selection = selector,
title='Date of death histograms - stacked'
)
# Create dataframe with mode value for line
m_stak <- 100
l_stak <-
alt$Chart(data.frame(m_stak))$
mark_rule(
color='darkgrey',
strokeDash=array(1,1)
)$
encode(x='m_stak:Q')
# Create text dataframe
td_stak <- '24-Sep'
t_stak <-
alt$Chart(data.frame(td_stak))$
mark_text(
dx=-2,
dy=-110
)$
encode(text='td_stak:N')$
properties(width = 350, height = 200)
d_stak <- h_stak + l_stak + t_stak
# Combine all graphs
final <- d_case | (d_over & d_stak)
final$configure_title(anchor='start')
NA
When the distributions are overlaid the highest point is at 19th September - an 8 day lag. When they stacked the peak is the 24th September - a 13 day lag. Clearly the model distribution here is not quite right, but it serves to demonstrate how all the distributions relate to each other. It is also clearer now that using the median as the lag will produce a less accurate estimation of the death rate when looking at the deaths reported each day. In practice it will under-estimate as cases increase and over-estimate as cases decrease. Aligning the peaks of the case numbers and the daily deaths will give the best approximation.
---
title: "COVID death lag discussion"
output: 
  html_notebook:
    df_print: kable
---
```{r results='hide'}
library(tidyverse)
library(sn)
library(knitr)
library(altair)
library(reticulate)
alt$data_transformers$disable_max_rows()

# Set basic theme for plots
theme_hcb <- theme_classic() +
             theme(panel.grid.major.y=element_line(colour = 'lightgrey')
              )
theme_set(theme_hcb)
```
The best way to calculate the rate of death from an illness is to look at the outcome for each confirmed case but data with this level of detail isn't always released publicly. NSW Health publishes daily cases and deaths for COVID as well as some summary statistics in a weekly surveillance report. The death rate can be estimated using the daily figures but it is necessary to account for the lag time between testing positive and passing away. This is pretty straightforward, but what should the lag be? 

The weekly surveillance reports published by NSW Health<sup>3</sup> state a median lag of 11 days from symptom onset to death. This has been consistent throughout the second half of 2021 where Delta was the dominant variant in NSW. It is also possible to calculate the observed lag by comparing case numbers and death numbers. This method shows a lag of 20 days<sup>2</sup>. This is a big difference! The observed lag is almost double the median.

These numbers point to a flat, heavily skewed distribution<sup>1</sup>. It makes sense that it could be approximated with a series of Poisson distributions, but without more information it's just guesswork. It's easier (and enough to show the effects) to define a distribution that has some of the observed properties and see what happens when it is applied to real world data. This should help explain the difference in the median lag and the observed lag and inform the choice of lag time.

```{r}
# Skewed distribution to approximate observed data
set.seed(55555)
# Set shape parameters
om=15
al=6
del=1
sn = rsn(n=10000, xi=del, omega=om, alpha=al)
sn = sn[sn>=0]
```
A skewed normal distribution was used to approximate the data. The median was set to match the NSW Health figure of 11 days, with a shape resembling those published by UK Health<sup>1</sup>. NB: This distribution is for the original strain.
```{r fig.width=8,fig.height=2}
# Plot approximation distribution
SN=data.frame(sn)
ggplot(data = SN, mapping = aes(x=sn)) +
  geom_histogram(
    color='white',
    fill='skyblue',
    bins=50
  ) +
  labs(
    x='Days after testing positive', 
    y='Count', 
    title='Approximate distribution of days until death'
  )
```
Summary statistics for the model distribution:
```{r results='asis'}
# Summary stats for generated data
kable(as.matrix(t(summary(sn))))
```

This corresponds to the distribution of deaths from one day of infections. 

Using the case number data from NSW Health (left graph), it should be possible to re-create the general shape of the death data. For each day a random sample is taken from the model distribution, essentially simulating when the deaths would occur. These are first laid over each other (top right graph) and then stacked on top of each other (bottom right graph). 

The purpose here is to show how the difference in lag occurs, so the overall shapes are more important than the number totals. Due to the relatively small number of cases each day, rather than take a percentage of cases to reflect the death rate no scaling has been done.

Selecting a day on the left graph shows where the deaths from that day would likely occur. The long right tail of the model distribution pushes the peak of the stacked histogram to the right.

```{r fig.width=12,fig.height=10}
# Read in case data and select NSW
cases = read.csv('cases_daily_state_NET.csv')
cases <- subset(cases, select=c(ï..Date,NSW))
# Convert date format and select time period of interest
cases <- rename(cases, Date=ï..Date)
cases$Date <- as.Date(cases$Date, '%d/%m/%y')
cases <- cases %>% filter(Date>'2021-06-15' & Date<'2021-10-31')
# Convert count format
cases$NSW <- as.numeric(gsub(",", "", cases$NSW))

# Make data frame with distributions
DS = data.frame(pos=integer(),day=character())
for(i in 1:length(cases$NSW))
{
  # generate distribution for that days positive tests + num days since start
  pos_tests = cases$NSW[i]
  if (pos_tests > 0) {
    d_dist <- rsn(n=pos_tests, xi=del, omega=om, alpha=al)
    d_dist <- d_dist[d_dist>=0]+i
    d_dist <- data.frame(d_dist, i) %>% rename(deaths=d_dist,day=i)
    DS <- rbind(DS, d_dist)
  }
}

# Convert day of infection and death to date - Altair needs timestamp
# Set start date
start <- as.POSIXct(as.Date('2021-06-15'))
# Define one day
plus1 <- as.POSIXct(as.Date('2021-06-16'))
one_day <- plus1-start
# Add to DS frame
DS$Date <- start + DS$day*one_day
DS$d_date <- start + floor(DS$deaths)*one_day
DS <- subset(DS, select = c(deaths, Date))
# Set number of bins for histograms
num_bins <- ceiling(max(DS$deaths)-min(DS$deaths))
# Extra entry - hack to extend x-axis for case histogram
# Altair doesn't do this nicely with datetimes
# NB there are 187 days from 16/6 - 20/12
#end <- as.POSIXct(as.Date('2021-12-20'))
#row <- data.frame(179,end)
#names(row) <- c("deaths", "Date")
#DS <- rbind(DS, row)

### Build interactive charts
# Define interactive plot selector
selector <- alt$selection_single(empty = "all", fields = list("Date"))

# Histogram of cases with line and date of max cases
h_case <- 
  alt$Chart(DS)$
  mark_bar(size=2)$
  encode(
    x = alt$X(
      "Date:T", 
      axis=alt$Axis(
        title='Date',
        grid=F
      ),
      timeUnit="monthdate"
    ),
    y = "count(Date):Q",
    color = alt$condition(
      selector,
      "Date:T", 
      legend=NULL,
      scale=alt$Scale(scheme=alt$SchemeParams(name='viridis', extent=c(-0.5,1.5))),
      alt$value("lightgray")
    )
  )$
  properties(
    selection = selector,
    title='Daily cases - NSW'
  )

# Create dataframe with mode value for line
m_case <- as.POSIXct(as.Date('2021-09-11'))
# Put vertical line showing mode on graph
l_case <- 
  alt$Chart(data.frame(m_case))$
  mark_rule(
    color='darkgrey',
    strokeDash=array(1,1)
  )$
  encode(x='monthdate(m_case):T')

# Create text dataframe
td_case <- '11-Sep'
t_case <- 
  alt$Chart(data.frame(td_case))$
  mark_text(
    dx=52,
    dy=-110
  )$
  encode(text='td_case:N')$
  properties(width = 350, height = 200)

# Combine layers
d_case <- h_case + l_case + t_case


### Distribution plots
# Overlaid histograms
h_over <- 
  alt$Chart(DS)$
  mark_area(interpolate="step")$
  encode(
    x = alt$X(
      "deaths", 
      bin=alt$Bin(maxbins=num_bins), 
      scale = alt$Scale(domain = c(0, 187)),
      axis=alt$Axis(title='Days since 16th June 2021')
    ),
    y = alt$Y(
      "count():Q", 
      stack=F, 
      scale = alt$Scale(domain = c(0, 100))
    ),
    color = alt$Color(
      "Date:T", 
      legend=NULL
    ),
    opacity=alt$condition(
      selector,
      alt$value(1),
      alt$value(0.05)
    )
  )$
  add_selection(selector)$
  properties(
    selection = selector, 
    title='Date of death histograms - overlaid'
  )

# Create dataframe with mode value for line
m_over <- 95
l_over <- 
  alt$Chart(data.frame(m_over))$
  mark_rule(
    color='darkgrey',
    strokeDash=array(1,1)
  )$
  encode(x='m_over:Q')

# Create text dataframe
td_over <- '19-Sep'
t_over <- 
  alt$Chart(data.frame(td_over))$
  mark_text(
    dx=-11,
    dy=-110
    )$
  encode(text='td_over:N')$
  properties(width = 350, height = 200)

# Combine layers
d_over <- h_over + l_over + t_over


# Stacked histograms
h_stak <- 
  alt$Chart(DS)$
  mark_area(interpolate="step")$
  encode(
    x = alt$X(
      "deaths", 
      bin=alt$Bin(maxbins=num_bins), 
      scale = alt$Scale(domain = c(0, 187)),
      axis=alt$Axis(title='Days since 16th June 2021')
    ),
    y = alt$Y(
      "count()",
      stack=T,
      scale = alt$Scale(domain = c(0, 1250))
    ),
    color = alt$Color(
      "Date:T", 
      legend=NULL
    ),
    opacity=alt$condition(
      selector,
      alt$value(1),
      alt$value(0.3)
    )
  )$
  add_selection(selector)$
  properties(
    selection = selector, 
    title='Date of death histograms - stacked'
  )

# Create dataframe with mode value for line
m_stak <- 100
l_stak <- 
  alt$Chart(data.frame(m_stak))$
  mark_rule(
    color='darkgrey',
    strokeDash=array(1,1)
  )$
  encode(x='m_stak:Q')

# Create text dataframe
td_stak <- '24-Sep'
t_stak <- 
  alt$Chart(data.frame(td_stak))$
  mark_text(
    dx=-2,
    dy=-110
  )$
  encode(text='td_stak:N')$
  properties(width = 350, height = 200)

d_stak <- h_stak + l_stak + t_stak

# Combine all graphs
final <- d_case | (d_over & d_stak)
final$configure_title(anchor='start')

```
When the distributions are overlaid the highest point is at 19th September - an 8 day lag. When they stacked the peak is the 24th September - a 13 day lag. Clearly the model distribution here is not quite right, but it serves to demonstrate how all the distributions relate to each other. It is also clearer now that using the median as the lag will produce a less accurate estimation of the death rate when looking at the deaths reported each day. In practice it will under-estimate as cases increase and over-estimate as cases decrease. Aligning the peaks of the case numbers and the daily deaths will give the best approximation.

<br><br><br>

#### References
1. COVID-19 Clinical Information Network (CO-CIN), <i>Time from symptom onset until death in UK hospitalised patients</i>, 7 October 2020, <https://www.gov.uk/government/publications/co-cin-covid-19-time-from-symptom-onset-until-death-in-uk-hospitalised-patients-7-october-2020>

2. Mike Honey 2021, accessed 23 December 2021, <https://github.com/Mike-Honey/covid-19-hospitalisations>

3. NSW Ministry of Health 2021, accessed 23 December 2021, <https://www.health.nsw.gov.au/Infectious/covid-19/Pages/weekly-reports.aspx>

Case data from <a href="https://infogram.com/1py2ljn6jl6q32b3lr1mnxn73gcy06qnjl2?live">covid19data.com.au</a>